Skip to content

xml: add preserve_namespaces option to retain XML namespace prefixes#4318

Open
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
ankit481:fix/xml-preserve-namespaces
Open

xml: add preserve_namespaces option to retain XML namespace prefixes#4318
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
ankit481:fix/xml-preserve-namespaces

Conversation

@ankit481
Copy link
Copy Markdown
Contributor

@ankit481 ankit481 commented Apr 20, 2026

Summary

Fixes #3928.

The xml processor and parse_xml bloblang method drop namespace prefixes on element and attribute names during XML→JSON conversion (<dc:title> becomes the key title), which makes the original XML impossible to reconstruct from the JSON output. The root cause is that github.com/clbanning/mxj always uses xml.Name.Local, so there is no flag on the existing call path that preserves prefixes.

This PR adds an opt-in preserve_namespaces boolean to both surfaces (processor field + bloblang param, default false). When enabled, parsing routes through a small stdlib-only walker in internal/impl/xml/package.go that:

  • Maintains a URI→prefix scope as it descends, seeded from xmlns:* declarations on each element.
  • Emits element and attribute keys as prefix:local when the namespace URI is known, falling back to the raw prefix (which Go's xml.Decoder leaves in Name.Space when a prefix is used but never declared).
  • Keeps xmlns:* declarations as -xmlns:prefix attributes so the original XML is reconstructable.
  • Preserves the existing output shape otherwise (-attr for attributes, #text for mixed content, arrays for repeated elements), and mirrors mxj's cast order (int → uint → float → bool) for the cast option.

Default behaviour is unchanged, so this is non-breaking for existing users.

Why not wait for upstream Go

The issue links Go CL 355353, which would add namespace-prefix retention to encoding/xml. That CL has been stalled for ~4 years. We don't actually need it: xml.Decoder already exposes the resolved URI in Name.Space and reports xmlns:* declarations as attributes, so a ~70 LOC walker can reconstruct prefixes today.

PoC results — validation on the issue's exact input

Before implementing, I validated three paths on the issue's XML:

<root xmlns:dc="http://my.namespace/dc" xmlns:ot="http://my.namespace/ot">
  <dc:title>This is a title</dc:title>
  <dc:description tone="boring">This is a description</dc:description>
  <ot:elements id="1">foo1</ot:elements>
  <ot:elements id="2">foo2</ot:elements>
  <ot:elements>foo3</ot:elements>
</root>

Path 1 — current mxj.NewMapXml (lossy, reproduces bug):

{
  "root": {
    "-dc": "http://my.namespace/dc",
    "-ot": "http://my.namespace/ot",
    "description": { "#text": "This is a description", "-tone": "boring" },
    "elements": [
      { "#text": "foo1", "-id": "1" },
      { "#text": "foo2", "-id": "2" },
      "foo3"
    ],
    "title": "This is a title"
  }
}

Element prefixes dropped; the xmlns: on the declarations is also lost (attributes appear as -dc / -ot rather than -xmlns:dc / -xmlns:ot).

Path 2 — mxj.NewMapXmlSeq (2-LOC change, but heavy output shift):

{
  "root": {
    "#attr": {
      "xmlns:dc": { "#seq": 0, "#text": "http://my.namespace/dc" },
      "xmlns:ot": { "#seq": 1, "#text": "http://my.namespace/ot" }
    },
    "dc:description": { "#attr": { "tone": { "#seq": 0, "#text": "boring" } }, "#seq": 1, "#text": "This is a description" },
    "dc:title": { "#seq": 0, "#text": "This is a title" },
    "ot:elements": [
      { "#attr": { "id": { "#seq": 0, "#text": "1" } }, "#seq": 2, "#text": "foo1" },
      { "#attr": { "id": { "#seq": 0, "#text": "2" } }, "#seq": 3, "#text": "foo2" },
      { "#seq": 4, "#text": "foo3" }
    ]
  }
}

Preserves prefixes but wraps every element in #attr / #seq metadata — would require every existing consumer to adapt to a new output shape.

Path 3 — custom stdlib walker (this PR):

{
  "root": {
    "-xmlns:dc": "http://my.namespace/dc",
    "-xmlns:ot": "http://my.namespace/ot",
    "dc:description": { "#text": "This is a description", "-tone": "boring" },
    "dc:title": "This is a title",
    "ot:elements": [
      { "#text": "foo1", "-id": "1" },
      { "#text": "foo2", "-id": "2" },
      "foo3"
    ]
  }
}

Prefixes preserved, output shape identical to the current default except for the added dc: / ot: / -xmlns:* keys. Round-trips cleanly to XML.

Tests

Added to internal/impl/xml/processor_test.go:

  • TestXMLPreserveNamespaces — 5 sub-cases:
    • Exact XML from The XML processor omits namespaces #3928.
    • preserve_namespaces: true on XML with no namespaces is a no-op vs. default output.
    • Nested element redeclares a prefix (xmlns:a at two levels mapping to different URIs).
    • Namespaced attribute (xsi:type).
    • Prefix used without an xmlns declaration stays literal.
  • TestXMLPreserveNamespacesWithCast — confirms cast: true applies to element and attribute values under the new path.
  • TestXMLDefaultStripsNamespacesUnchanged — regression guard: omitting the flag keeps the current lossy behaviour byte-for-byte.

Added to internal/impl/xml/bloblang_test.go:

  • parse_xml(preserve_namespaces: true) round-trip.
  • parse_xml() without the flag still strips prefixes (opt-in guard).

Full test run:

ok  github.com/redpanda-data/connect/v4/internal/impl/xml  0.302s

All existing tests (TestXMLCases, TestXMLWithCast, TestParseXML) continue to pass unchanged.

Test plan

  • go test ./internal/impl/xml/ passes
  • go vet ./internal/impl/xml/ clean
  • go build ./internal/impl/xml/ clean
  • Default (preserve_namespaces omitted or false) output byte-identical to previous behaviour
  • Issue's XML round-trips with preserve_namespaces: true
  • CI green on the PR

The default XML to JSON conversion drops namespace prefixes on element
and attribute names (e.g. `<dc:title>` becomes the key `title`) because
the underlying github.com/clbanning/mxj library uses only Name.Local.
This makes the original XML impossible to reconstruct from the JSON.

Adds an opt-in `preserve_namespaces` flag to both the `xml` processor
and the `parse_xml` bloblang method. When enabled, parsing routes
through a small stdlib-only walker that keeps a URI->prefix map from
`xmlns:*` declarations as it descends, so element and attribute keys
keep their `prefix:local` form and `xmlns:*` declarations are emitted
as `-xmlns:*` attributes.

Default behaviour is unchanged. Fixes redpanda-data#3928.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The XML processor omits namespaces

1 participant